POLI 572B

Michael Weaver

March 2, 2020

Inference with Least Squares

Objectives

Recap

  • Least Squares as a predictive algorithm

Inference

  • Least Squares as an estimator of some unknown parameters

Today

Least squares as

  • algorithm for prediction
    • for “machine learning” to predict \(E(Y | X)\)
    • where we care about accuracy \(E(\widehat{Y} | X)\), not about correctness of \(\beta\)

Today

Least Squares

  • statistical model for inference
    • what is the statistical model? (like, e.g. Neyman Causal Model)
    • what are the assumptions we need for…
      • unbiasedness of estimator
      • deriving uncertainty of estimate

Recap

Least Squares for Prediction

So far, we…

  • …start with a vector \(\mathbf{Y} = (Y_1 \ldots Y_i \ldots Y_n)\)

  • … want to produce a vector of predicted values \(\mathbf{\widehat{Y}} = (\widehat{Y_1} \ldots \widehat{Y_i} \ldots \widehat{Y_n})\)

  • …. define \(\mathbf{\widehat{Y}} = \mathbf{X}\mathbf{\beta}\):

    • where \(\mathbf{X}\) is a matrix of predictors (for the mean, just a vector of \(1\)s) and \(\mathbf{\beta}\) is a vector of linear coefficients that are multiplied by \(\mathbf{X}\) such that \(\widehat{Y}_i = \beta_0x_0 + \beta_1x_1 + \ldots + \beta_px_p\)

Least Squares for Prediction

Then, we…

  • … choose \(\mathbf{\widehat{Y}}\) to have minimum (euclidean) distance to \(\mathbf{Y}\):

    • minimize \(\mathbf{Y} - \mathbf{\widehat{Y}} = \mathbf{e}\); where \(\mathbf{e}\) is a vector of residuals (prediction errors) \((Y_1 - \hat{Y_1}, \ldots , Y_i - \hat{Y_i}, \ldots, Y_n - \hat{Y}_n)\)
  • … find that \(\mathbf{e}\) is minimized when \(\mathbf{e}'\mathbf{X} = \mathbf{0}\): residuals are orthogonal to \(\mathbf{X}\).

This gave us \(\mathbf{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\)

Least Squares for Prediction

This definition of least squares means that:

  • residuals \(\mathbf{e}\) are orthogonal to \(\mathbf{X}\); so if there is an intercept, the mean of the residuals is \(0\), and covariance of residuals and \(\mathbf{X}\) is \(0\).

  • If we just fit an intercept (same value of \(\widehat{Y}\) for all cases): what is the mean of the residuals \(E(Y_i - \widehat{Y_i})\)?

Return to Multivariate

Let’s say we want to get the conditional expectation function for yearly earnings of professionals as a function of hours worked, profession, gender, and age.

Using data from the American Community Survey, we can look at how hours worked is related to earnings for doctors and lawyers.

  • UHRSWORK (the usual hours worked per week)
  • FEMALE (an indicator for female)
  • AGE (in years)
  • LAW (an indicator for being a lawyer)
  • INCEARN (total annual income earned in dollars).

Return to Multivariate

Let’s now find the (approximate) linear conditional expectation function of earnings, across hours worked, age, profession, and gender:

Example

## 
## Call:
## lm(formula = INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, data = acs_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -291485  -89191  -28271   71281 1067525 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9929.8     9370.4    1.06    0.289    
## UHRSWORK      1245.4      111.9   11.13   <2e-16 ***
## AGE           3311.3      125.0   26.50   <2e-16 ***
## LAW         -49672.0     2647.1  -18.76   <2e-16 ***
## FEMALE      -42281.4     2811.5  -15.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 127300 on 9995 degrees of freedom
## Multiple R-squared:  0.1502, Adjusted R-squared:  0.1498 
## F-statistic: 441.6 on 4 and 9995 DF,  p-value: < 2.2e-16

How do we make sense of, e.g., the slope on AGE?

Example

Example

Example

Example

Example

In your small groups:

  1. could we do better at better approximating the conditional expectation function, e.g., of hours worked by Age?

  2. How would you do it?

  3. How would you this change interpretation of UHRSWORK?

ls = lm(INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, acs_data)

Inference

Key Points:

  1. We want to interpret the \(\beta\) vector causally.

  2. There is a true \(\beta_D\) (\(\delta\)) that is the effect of some \(D\) on \(Y\).

  3. \(\beta_D\) is an unknown parameter.

  4. To estimate \(\beta_D\), need a statistical model for least squares

A Statistical Model

As with experiments (e.g. the Neyman causal model), statistical models

  • treat observed data as realizations of random variables (some chance process)
  • with some assumptions…
  • permit us to connect what we observe to unobserved parameters

Need to clarify what work is being done by

  • data
  • regression mathematics
  • assumptions of statistical model

An example: Hooke’s Law

Hooke’s law is a law of physics that states that the force (\(D\)) needed to extend or compress a spring by some distance \(y\) increases linearly with respect to the distance.

Thus, for a given weight \(D_i\) on a spring, the length \(Y_i\) is:

\[Y_i = \alpha + \beta D_i + \epsilon_i\] for \(i\) in \(1 \ldots n\)

An example: Hooke’s Law

\[Y_i = \alpha + \beta D_i + \epsilon_i\] for \(i\) in \(1 \ldots n\)

This is a statistical model:

  • \(\alpha\) and \(\beta\) are unknown parameters (thus, greek letters, not \(a\) and \(b\)). They are fixed.
  • \(\epsilon_i\) is a random error (a random variable). Each \(\epsilon_i\) is…

    • independent of \(D_i\)
    • drawn from the same distribution with mean \(0\) and variance \(\sigma^2\) (another parameter)

We use observed data to estimate the parameters \(\alpha, \beta, \sigma^2\)

In groups:

Imagine we have a spring whose length is 30 cm. With each Newton of force applied, the spring increases by 0.5 cm.

Using these facts and the statistical model described above: create a response schedule/potential outcomes table corresponding to the spring under loads of \(0,1,2,3,5,8\) Newtons.

The Regression Model:

The statistical model for the regression makes certain assumptions about the process generating the data.

The statistical model assumes that this linear equation describes the process generating the data.

\[Y_i = \alpha + \beta D_i + \epsilon_i\]

deterministic (not random \(*\)) \(\alpha + \beta D_i\)

stochastic (random error) \(\epsilon_i\)

The Regression Model:

\(\alpha\): parameter for the intercept (fixed for all cases)

\(\beta\): parameter for the slope (fixed for all cases)

\(D_i\): value of independent variable (input into the equation)

\(Y_i\): value of dependent variable (determined by the equation)

\(\epsilon_i\): random error for observation \(i\)

  • each \(\epsilon_i\) is realization of same random variable with mean \(0\) and variance \(\sigma^2\).
  • independent of each other and \(D_i\)

The Regression Model:

The more general regression model (where \(X_i\) is a matrix of variables)

\[Y_i = \mathbf{X_i\beta} + \epsilon_i\]

can be estimated with least squares:

\[Y_i = X_i\hat{\beta} + e_i\]

where \(Var(e) = \widehat{\sigma^2}\)

The Regression Model:

Differences from Least Squares Algorithm

  • we now have estimates (\(\widehat{}\)) of parameters
  • we now have unobserved errors \(\epsilon_i\) instead of residuals \(e_i\).

Least Squares is the Best Linear Unbiased Estimator of our model parameters under 3 key assumptions.

  • unbiased: \(\beta - E(\widehat{\beta}) = 0\)
  • “best” as in estimate has least variance (not so important in practice)

Model Assumptions:

\((1)\). Model equation is correctly specified

  • correct functional form (e.g., linear) for process by which variables in \(X\) affect \(Y\), and how variables in \(X\) affect each other.
  • all relevant variables affecting \(Y\) (will define later) are included in the model
  • \(Y_i = \mathbf{X_i}\mathbf{\beta} + \epsilon_i\) is the true data-generating function

In groups: come up with 2 ways this assumption could be wrong in the context of this regression: lm(INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, acs_data)

Model Assumptions:

\((2.)\) \(\epsilon_i\) are independent, identically distributed (i.i.d.)

  • \(E(\epsilon_i) = 0\) and variance of \(\sigma^2\).
  • homoskedastic vs. heteroskedastic

In groups: come up with 2 ways this assumption could be wrong in the context of the Hooke’s Law example.

Model Assumptions:

\((3.)\) \(\epsilon_i\) is independent of \(X_i\): \(X_i \perp \!\!\! \perp \epsilon_i\)

  • This means \(X\) is independent of \(\epsilon\) (recall dependence versus independence)
  • independence is not orthogonality

In groups: come up with 2 ways this assumption could be wrong in the context of this regression: lm(INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, acs_data)

Model Assumptions:

Under these assumptions we call the model “ordinary least squares”

We cannot empirically verify these assumptions.

In particular, cannot verify assumption 3:

  • When using least squares to estimate this model, we mathematically impose \(e_i\) to be orthogonal to \(x_i\) and \(E(e_i) = 0\). This is a mathematical feature of regression, not proof that \(X_i \perp \!\!\! \perp \epsilon_i\)

Model Assumptions:

If these assumptions are reasonable for the data we observe, then with large \(n\), we can estimate our parameters \(\beta, \sigma^2\) using least squares.

Estimating the Regression Model

The Regression Model:

Mathematical Assumptions (least squares):

  • design matrix has full rank (no linear dependence b/t columns)
  • \(n \geq p\)

Statistical Assumptions (statistical model):

  • \(Y_i\) is generated by \(\mathbf{X_i\beta} + \epsilon_i\)
  • \(\epsilon_i\) are independent, identically distributed, with variance \(\sigma^2\)
  • \(X_i \perp \!\!\! \perp \epsilon_i\)

Estimation:

If \(\mathbf{X}\) is \(n \times p\) matrix: We want to estimate \(p + 1\) parameters… (what are they?)

\(\beta\) estimated by \(\widehat{\beta}\) using least squares:

\[\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\]

  • Note: \(\widehat{\beta}\) is a random vector; each element is a random variable. Why?

Estimation:

Under the statistical assumption we made, \(\widehat{\beta}\) is an unbiased estimator of \(\beta\), conditional on \(X\): \(E(\widehat{\beta}|X) = \beta\)

  • “conditional on \(X\)” means taking the values of \(X\) as fixed

LS estimate is unbiased:

By assumption: \(Y = \mathbf{X\beta} + \epsilon\), so:

\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\)

\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'(\mathbf{X\beta} + \epsilon)\)

\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)

\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)

LS estimate is unbiased:

\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)

So:

\(E(\widehat{\beta}|X) = E(\mathbf{\beta}|\mathbf{X}) + E((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon|\mathbf{X})\)

\(E(\widehat{\beta}|X) = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon|\mathbf{X})\)

And because, by assumption, \(X \perp \!\!\! \perp \epsilon\)

\(E(\widehat{\beta}|X) = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon)\)

And because, by assumption, \(E(\epsilon) = 0\)

\(E(\widehat{\beta}|X) = \mathbf{\beta}\)

Variance of LS Estimator:

Thus, if our model assumptions are correct, \(\widehat{\beta}\) is an unbiased estimate of parameter \(\beta\)

  • \(\widehat{\beta}\) is a random vector, as it depends on chance draws of \(\epsilon\)
  • As \(n \to \infty\), \(\widehat{\beta} \approx \beta\), but there is still uncertainty.

We need to find a way of finding the uncertainty in our estimate \(\widehat{\beta}\)

Variance of our Estimate

Variance of OLS Estimator:

  • Each element of \(\widehat{\beta}_{p \times 1}\) is a random variable; it has a variance and a covariance
  • In the variance-covariance matrix of \(\widehat{\beta}\) (all \(\beta\)s should be hatted)

\[\scriptsize{\begin{pmatrix} Var(\beta_1) & Cov(\beta_2,\beta_1) & Cov(\beta_3,\beta_1) &\ldots & Cov(\beta_p,\beta_1) \\ Cov(\beta_1,\beta_2) & Var(\beta_2) & Cov(\beta_3,\beta_2) & \ldots & Cov(\beta_p,\beta_2) \\ Cov(\beta_1,\beta_3) & Cov(\beta_2,\beta_3) & Var(\beta_3) & \ldots & Cov(\beta_p,\beta_3) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ Cov(\beta_1,\beta_p) & Cov(\beta_2,\beta_p) & Cov(\beta_3, \beta_p) & \ldots & Var(\beta_p)\end{pmatrix}}\]

  • Diagonal elements are the variances, off-diagonal elements are covariances; matrix is symmetric

Variance of OLS Estimator:

How do we use the variance-covariance matrix?

  • The square-root of diagonal elements (variances) gives standard error for each parameter estimate in \(\widehat{\beta}\) (hypothesis testing)

  • The off-diagonal elements can help answer: \(\beta_2 + \beta_3 \neq 0\). We need \(Cov(\beta_2, \beta_3)\) to get \(Var(\beta_2 + \beta_3)\). (interaction effects)

Variance-Covariance Matrix

\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\), So:

\[cov(\widehat{\beta}|X) = E((\widehat{\beta} - \beta)(\widehat{\beta} - \beta)' | X)\]

\[ = E( ((\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\epsilon)((\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\epsilon)' | X)\]

\[ = E( ((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon)(\epsilon'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}) | X)\]

\[ = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon\epsilon'|X)\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]

What really matters here is: \(E(\epsilon\epsilon'|X)\)

Variance-Covariance Matrix

All variance-covariance matrices are “sandwiches”:

\[ (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon\epsilon'|X)\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]

\((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\) is the bread; \(E(\epsilon\epsilon'|X)\) is the meat. We can make different choices of “meat”.

Variance-Covariance Matrix

In groups: if \(\epsilon\) is a vector that is \(n \times 1\) with elements \((\epsilon_1 \ldots \epsilon_n)\)

  1. What are the dimensions of the matrix: \(\epsilon\epsilon'\)?

  2. What are the elements on the diagonal of the matrix?

  3. What is the expected value of the elements on the off-diagonal

Hints: We have assumed that \(E(\epsilon) = 0\); \(\epsilon_i\) are independent of each other.

Variance-Covariance Matrix

\[E(\epsilon\epsilon'|X) = \sigma^2\mathbf{I}_{n\times n}\]

\(\epsilon\epsilon'\) is \(n \times n\) matrix with \(ij\)th elements \(\epsilon_i\epsilon_j\)

  • this is the variance covariance matrix of the errors (we don’t subtract means because \(E(\epsilon) = 0\))

Variance-Covariance Matrix

Taking the expectation:

  • If \(i \neq j\), the elements of the matrix are \(E(\epsilon_i\epsilon_j)\). Because, by assumption, \(E(\epsilon) = 0\), and \(\epsilon\)s are independent, \(E(\epsilon_i\epsilon_j) = E(\epsilon_i)E(\epsilon_j) = 0 \times 0 = 0\)
  • but, for \(i = j\), \(E(\epsilon_i^2) = E(\epsilon_i - E(\epsilon))^2\) (a variance)
  • and because \(\epsilon\) is identically distributed, then \(Var(\epsilon_i) = Var(\epsilon) = \sigma^2\) for all \(\epsilon\)

Variance-Covariance Matrix

\[ = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\sigma^2\mathbf{I}_{n\times n}\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]

\[ = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\] \[cov(\widehat{\beta}|X) = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\]

This gives us the variance-covariance matrix of \(\widehat{\beta}\). Can we start computing standard errors for \(\widehat{\beta}\) now?

Variance-Covariance Matrix

We use residuals to stand in for \(\epsilon\).

\[\widehat{\sigma^2} = \frac{\sum\limits_i^n e_i^2}{n - p}\]

\(E[\widehat{\sigma^2} | X] = \sigma^2\) (unbiased)

  • Why \(n-p\)? The variance of the sample mean required \(n-1\), because we calculated the mean first, so we lost 1 degree of freedom. With bivariate regression, we lose 2 df. With \(p\) variables, we lose \(p\) degrees of freedom.

Variance-Covariance Matrix

so, we estimate the variance-covariance matrix:

\(\widehat{cov}(\widehat{\beta}) = \widehat{\sigma^2}(\mathbf{X}'\mathbf{X})^{-1}\)

Standard Errors of \(\widehat{\beta}\)

The variance (first diagonal element of \(\widehat{cov}(\widehat{\beta})\)) is

\(Var(\hat{\beta_p}) = \frac{\hat{\sigma^2}}{\sum\limits_i^n (X^*_{i} - \overline{X^*})^2}\)

So the standard error is:

\(SE(\hat{\beta_p}) = \sqrt{Var(\hat{\beta_p})} = \frac{\widehat{\sigma}}{\sqrt{n}\sqrt{Var(X_p^*)}}\)

  • standard errors get smaller (more precise): with growing \(n\), increased (residual) variation in \(X_p\), smaller error variance (\(\sigma^2\))

Sampling Distribution of \(\hat{\beta}\)

asymptotic normality:

Even if \(\epsilon\) not normally distributed: under our assumptions, then \(\lim_{n\to\infty}\) sampling distribution of \(\hat{\beta}\) is normally distributed by the central limit theorem. Reasonable if \(n\) is moderately large and data is not highly skewed.

Because we estimate standard errors: \(t\) statistic for hypothesis testing of \(\widehat{\beta}\) asymptotically \(t\) distributed:

Standard errors

In sum:

  • We get \(\widehat{\sigma^2}\) by taking variance of the residuals (\(e\))
  • We take diagonal of variance-covariance matrix to get variances of slopes (\(\widehat{\sigma^2}(\mathbf{X}'\mathbf{X})^{-1}\))
  • Take square root of variances to standard errors for \(\widehat{\beta}\)

Standard errors

For standard errors of \(\widehat{\beta}\) to be correct, we must assume:

  1. \(Y\) is generated by \(\mathbf{X\beta} + \epsilon\)
  2. \(\epsilon_i\) are independent, identically distributed, with variance \(\sigma^2\) for all \(i\)
  3. \(X_i \perp \!\!\! \perp \epsilon_i\)

If any assumption is wrong: standard errors will be incorrect.

If assumption 1 and 3 are wrong: unbiasedness of \(\widehat{\beta}\) does not hold.

In groups:

  1. \(Y\) is generated by \(\mathbf{X\beta} + \epsilon\)
  2. \(\epsilon_i\) are independent, identically distributed, with variance \(\sigma^2\) for all \(i\)
  3. \(X_i \perp \!\!\! \perp \epsilon_i\)

What is an example of the assumption being wrong?

Why would it lead our standard errors to be wrong?

OLS Assumptions:

When do things go wrong?

  • We incorrectly specify the model (it is multiplicative, or we forgot a term)
  • \(Y_i = \alpha + \beta_1 x + \beta_2 x^2 + \epsilon\)
  • \(Y_i = \alpha + \beta_1 x + \beta_2 z + \epsilon\)
  • Errors are heteroskedastic: errors have different variances.
  • E.g.: At higher levels of education, incomes take on wider range of values (think PhD vs. MBA students earning potential), so the errors are larger.

OLS Assumptions:

  • Errors are not independent: errors for different observations are correlated.
  • We observe the same counties over time. The error we observer for county \(i\) at time \(t\) is likely to be correlated with the error of county \(i\) at time \(t+1\), because it is the same place.
  • \(\epsilon\) is not independent of \(X\). The errors are in fact correlated with \(X\):
  • This is “confounding”. Our estimate \(\hat{\beta}\) as the coefficient of \(x\) will be biased. If we excluded \(z\) from the equation incorrectly, the bias in \(x\) will be a function of the relationships between \(x\) and \(z\), and between \(z\) and \(y\).

Conclusion:

  1. \(Y\) is generated by \(\mathbf{X\beta} + \epsilon\)
  2. \(\epsilon_i\) are independent, identically distributed, with variance \(\sigma^2\) for all \(i\)
  3. \(X_i \perp \!\!\! \perp \epsilon_i\)

With these assumptions:

  • \(\widehat{\beta}\) is unbiased estimator for \(\beta\)
  • known sampling distribution of \(\widehat{\beta}\)